254 research outputs found
Logical Segmentation of Source Code
Many software analysis methods have come to rely on machine learning
approaches. Code segmentation - the process of decomposing source code into
meaningful blocks - can augment these methods by featurizing code, reducing
noise, and limiting the problem space. Traditionally, code segmentation has
been done using syntactic cues; current approaches do not intentionally capture
logical content. We develop a novel deep learning approach to generate logical
code segments regardless of the language or syntactic correctness of the code.
Due to the lack of logically segmented source code, we introduce a unique data
set construction technique to approximate ground truth for logically segmented
code. Logical code segmentation can improve tasks such as automatically
commenting code, detecting software vulnerabilities, repairing bugs, labeling
code functionality, and synthesizing new code.Comment: SEKE2019 Conference Full Pape
That Escalated Quickly: An ML Framework for Alert Prioritization
In place of in-house solutions, organizations are increasingly moving towards
managed services for cyber defense. Security Operations Centers are specialized
cybersecurity units responsible for the defense of an organization, but the
large-scale centralization of threat detection is causing SOCs to endure an
overwhelming amount of false positive alerts -- a phenomenon known as alert
fatigue. Large collections of imprecise sensors, an inability to adapt to known
false positives, evolution of the threat landscape, and inefficient use of
analyst time all contribute to the alert fatigue problem. To combat these
issues, we present That Escalated Quickly (TEQ), a machine learning framework
that reduces alert fatigue with minimal changes to SOC workflows by predicting
alert-level and incident-level actionability. On real-world data, the system is
able to reduce the time it takes to respond to actionable incidents by
, suppress of false positives with a detection rate,
and reduce the number of alerts an analyst needs to investigate within singular
incidents by .Comment: Submitted to Usenix Security Symposiu
A Language-Agnostic Model for Semantic Source Code Labeling
Code search and comprehension have become more difficult in recent years due
to the rapid expansion of available source code. Current tools lack a way to
label arbitrary code at scale while maintaining up-to-date representations of
new programming languages, libraries, and functionalities. Comprehensive
labeling of source code enables users to search for documents of interest and
obtain a high-level understanding of their contents. We use Stack Overflow code
snippets and their tags to train a language-agnostic, deep convolutional neural
network to automatically predict semantic labels for source code documents. On
Stack Overflow code snippets, we demonstrate a mean area under ROC of 0.957
over a long-tailed list of 4,508 tags. We also manually validate the model
outputs on a diverse set of unlabeled source code documents retrieved from
Github, and we obtain a top-1 accuracy of 86.6%. This strongly indicates that
the model successfully transfers its knowledge from Stack Overflow snippets to
arbitrary source code documents.Comment: MASES 2018 Publicatio
Stan: A Probabilistic Programming Language
Stan is a probabilistic programming language for specifying statistical models. A Stan program imperatively defines a log probability function over parameters conditioned on specified data and constants. As of version 2.14.0, Stan provides full Bayesian inference for continuous-variable models through Markov chain Monte Carlo methods such as the No-U-Turn sampler, an adaptive form of Hamiltonian Monte Carlo sampling. Penalized maximum likelihood estimates are calculated using optimization methods such as the limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm. Stan is also a platform for computing log densities and their gradients and Hessians, which can be used in alternative algorithms such as variational Bayes, expectation propagation, and marginal inference using approximate integration. To this end, Stan is set up so that the densities, gradients, and Hessians, along with intermediate quantities of the algorithm such as acceptance probabilities, are easily accessible. Stan can be called from the command line using the cmdstan package, through R using the rstan package, and through Python using the pystan package. All three interfaces support sampling and optimization-based inference with diagnostics and posterior analysis. rstan and pystan also provide access to log probabilities, gradients, Hessians, parameter transforms, and specialized plotting
Recommended from our members
How to Change the Weight of Rare Events in Decisions from Experience
When making risky choices, two kinds of information are crucial: outcome values and outcome probabilities. Here, we demonstrate that the juncture at which value and probability information is provided has a fundamental effect on choice. Across four experiments involving 489 participants, we compare two decision making scenarios: one where value information is revealed during sampling (Standard), and one where value information is revealed after sampling (Value-Ignorance). On average, participants made riskier choices when value information was provided after sampling. Moreover, parameter estimates from a hierarchical Bayesian implementation of cumulative prospect theory suggested that participants overweighted rare events when value information was absent during sampling, but showed no overweighting in the Standard condition. This suggests that the impact of rare events on choice relies crucially on the timing of probability and value integration. We provide paths towards mechanistic explanations of our results based on frameworks which assume different underlying cognitive architectures
Possible Disintegrating Short-Period Super-Mercury Orbiting KIC 12557548
We report here on the discovery of stellar occultations, observed with
Kepler, that recur periodically at 15.685 hour intervals, but which vary in
depth from a maximum of 1.3% to a minimum that can be less than 0.2%. The star
that is apparently being occulted is KIC 12557548, a K dwarf with T_eff = 4400
K and V = 16. Because the eclipse depths are highly variable, they cannot be
due solely to transits of a single planet with a fixed size. We discuss but
dismiss a scenario involving a binary giant planet whose mutual orbit plane
precesses, bringing one of the planets into and out of a grazing transit. We
also briefly consider an eclipsing binary, that either orbits KIC 12557548 in a
hierarchical triple configuration or is nearby on the sky, but we find such a
scenario inadequate to reproduce the observations. We come down in favor of an
explanation that involves macroscopic particles escaping the atmosphere of a
slowly disintegrating planet not much larger than Mercury. The particles could
take the form of micron-sized pyroxene or aluminum oxide dust grains. The
planetary surface is hot enough to sublimate and create a high-Z atmosphere;
this atmosphere may be loaded with dust via cloud condensation or explosive
volcanism. Atmospheric gas escapes the planet via a Parker-type thermal wind,
dragging dust grains with it. We infer a mass loss rate from the observations
of order 1 M_E/Gyr, with a dust-to-gas ratio possibly of order unity. For our
fiducial 0.1 M_E planet, the evaporation timescale may be ~0.2 Gyr. Smaller
mass planets are disfavored because they evaporate still more quickly, as are
larger mass planets because they have surface gravities too strong to sustain
outflows with the requisite mass-loss rates. The occultation profile evinces an
ingress-egress asymmetry that could reflect a comet-like dust tail trailing the
planet; we present simulations of such a tail.Comment: 14 pages, 7 figures; submitted to ApJ, January 10, 2012; accepted
March 21, 201
Systematizing Confidence in Open Research and Evidence (SCORE)
Assessing the credibility of research claims is a central, continuous, and laborious part of the scientific process. Credibility assessment strategies range from expert judgment to aggregating existing evidence to systematic replication efforts. Such assessments can require substantial time and effort. Research progress could be accelerated if there were rapid, scalable, accurate credibility indicators to guide attention and resource allocation for further assessment. The SCORE program is creating and validating algorithms to provide confidence scores for research claims at scale. To investigate the viability of scalable tools, teams are creating: a database of claims from papers in the social and behavioral sciences; expert and machine generated estimates of credibility; and, evidence of reproducibility, robustness, and replicability to validate the estimates. Beyond the primary research objective, the data and artifacts generated from this program will be openly shared and provide an unprecedented opportunity to examine research credibility and evidence
A New Approach for Assessment of Mental Architecture: Repeated Tagging
A new approach to the study of a relatively neglected property of mental architecture—whether and when the already-processed elements are separated from the to-be-processed elements—is proposed. The process of numerical proportion discrimination between two sets of elements defined either by color or by orientation can be described as sampling with or without replacement (characterized by binomial or hypergeometric probability distributions respectively) depending on the possibility to tag an element once or repeatedly. All empirical psychometric functions were approximated by a theoretical model showing that the ability to keep track of the already tagged elements is not an inflexible part of the mental architecture but rather an individually variable strategy which also depends on conspicuity of perceptual attributes. Strong evidence is provided that in a considerable number of trials, observers tagged the same element repeatedly which can only be done serially at two separate time moments
- …